%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% !TEX root = ../sutton_learning_1988.tex
\chapter{TD and SL approaches to prediction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{center}
  \begin{minipage}{0.5\textwidth}
    \begin{small}
      This section \ldots
    \end{small}
  \end{minipage}
  \vspace{0.5cm}
\end{center}

Historically,
the most important learning paradigm has been that of supervised learning.
In this framework the learner is asked to associate pairs of items.
When later presented with just the first item of a pair,
the learner is supposed to recall the second.
This paradigm has been used in pattern classification,
concept acquisition,
learning from examples,
system identification,
and associative memory.
For example,
in pattern classification and concept acquisition,
the first item is an instance of some pattern or concept,
and the second item is the name of that concept.
In system identification,
the learner must reproduce the input-output behavior of some unknown system.
Here,
the first item of each pair is an input
and the second is the corresponding output.
Any prediction problem can be cast in the supervised-learning paradigm
by taking the first item to be the data
based on which a prediction must be made,
and the second item to be the actual outcome,
what the prediction should have been.
For example,
to predict Saturday's weather,
one can form a pair from the measurements taken on Monday
and the actual observed weather on Saturday,
another pair from the measurements taken on Tuesday and Saturday's weather,
and so on.
Although this pairwise approach ignores the sequential structure of the problem,
it is easy to understand and analyze and has been widely used.
In this paper,
we refer to this as the supervised-learning approach to prediction learning,
and we refer to learning methods that take this approach
as supervised-learning methods.
We argue that such methods are inadequate,
that \gls{td} methods are far preferable.

\section{Single-step and multi-step prediction}

To clarify this claim,
we distinguish two kinds of prediction-learning problems.
In single-step prediction problems,
all information about the correctness of each prediction is revealed at once.
In multi-step prediction problems,
correctness is not revealed
until more than one step after the prediction is made,
while partial information relevant to its correctness is revealed at each step.
For example,
the weather prediction problem mentioned above
is a multi-step prediction problem
because inconclusive evidence relevant to the correctness of Monday's prediction
becomes available in the form of new observations
on Tuesday, Wednesday, Thursday and Friday.
On the other hand,
if each day's weather were to be predicted
on the basis of the previous day's observations---that is,
on Monday predict Tuesday's weather,
on Tuesday predict Wednesday's weather, etc.
---one would have a single-step prediction problem,
assuming no further observations were made
between the time of each day's prediction
and its confirmation or refutation on the following day.

In this paper,
we will be concerned only with multi-step prediction problems.
In single-step problems,
data naturally comes in observation–outcome pairs;
these problems are ideally suited to the pairwise supervised-learning approach.
Temporal-difference methods can not be distinguished
from supervised-learning methods in this case;
thus the former improve over conventional methods only on multi-step problems.
However,
we argue that these predominate in realworld applications.
For example,
predictions about next year's economic performance are not
confirmed or disconfirmed all at once,
but rather bit by bit as the economic situation is observed through the year.
The likely outcome of elections is updated with each new poll,
and the likely outcome of a chess game is updated with each move.
When a baseball batter predicts whether a pitch will be a strike,
he updates his prediction continuously during the ball's flight.
In fact,
many problems that are classically cast as single-step prediction problems
are more naturally viewed as multi-step problems.
Perceptual learning problems,
such as vision or speech recognition,
are classically treated as supervised learning,
using a training set of isolated,
correctly-classified,
input patterns.
When people hear or see things,
on the other hand,
they receive a stream of input over time and
constantly update their hypotheses about what they are seeing or hearing.
People are faced not with a single-step problem of unrelated pattern-class pairs,
but rather with a series of related patterns,
all providing information about the same classification.
To disregard this structure seems improvident.

\section{Computational issues}

In this subsection,
we introduce a particular TD procedure by formally relating it to a classical supervised-learning procedure,
the Widrow-Hoff rule.
We show that the two procedures produce exactly the same weight changes,
but that the TD procedure can be implemented incrementally and therefore requires far less computational power.
In the following subsection,
this TD procedure will be used also as a conceptual bridge to a larger family of TD procedures that produce different weight changes than any supervised-learning method.
First,
we detail the representational assumptions that will be used throughout the paper.

We consider multi-step prediction problems in which
experience comes in observation-outcome sequences of the form
\(x_1,x_2,x_3,\ldots,x_m,z\),
where each \(x_t\) is a vector of observations available
at time \(t\) in the sequence,
and \(z\) is the outcome of the sequence.
Many such sequences will normally be experienced.
The components of each \(x_t\) are assumed to be real-valued measurements or features,
and \(z\) is assumed to be a real-valued scalar.
For each observation-outcome sequence,
the learner produces a corresponding sequence of predictions
\( P_1, P_2, P_3, \ldots, P_m \),
each of which is an estimate of \(z\).
In general,
each \(P_t\) can be a function of all preceding observation vectors up through time \(t\),
but, for simplicity,
here we assume that it is a function only of \(x_t\).
The predictions are also based on a vector of modifiable parameters or weights,
\(w\).
\(P_t\)'s functional dependence on \(x_t\)
and \(w\) will sometimes be denoted explicitly by writing it as \(P(x_t,w)\).

All learning procedures will be expressed as rules for updating \(w\).
For the moment we assume that \(w\) is updated only once
for each complete observation-outcome sequence
and thus does not change during a sequence.
For each observation,
an increment to \(w\),
denoted \(\Delta w_t\),
is determined.
After a complete sequence has been processed,
w is changed by (the sum of) all the sequence's increments:
\begin{equation} \label{eq:update_w}
    w \leftarrow w + \sum_{t=1}^m \Delta w_t.
\end{equation}
Later,
we will consider more incremental cases in which \(w\) is updated after each observation,
and also less incremental cases in which it is updated only after accumulating \(\Delta w_t\)'s over a training set consisting of several sequences.
The supervised-learning approach treats each sequence of observations and its outcome as a sequence of observation-outcome pairs;
that is,
as the pairs \((x_1,z)\),\((x_2,z)\),...,\((x_m,z)\).
The increment due to time \(t\) depends on the error between \(P_t\) and \(z\),
and on how changing \(w\) will affect \(P_t\).
The prototypical supervised-learning update procedure is
\begin{equation} \label{eq:compute_Delta_w_t}
	\Delta w_t = \alpha(z − P_t)\grad_w P_t,
\end{equation}
where α is a positive parameter affecting the rate of learning,
and the gradient,
\(\grad_w P_t\),
is the vector of partial derivatives of \(P_t\) with respect to each component of \(w\).

For example,
consider the special case in which \(P_t\) is a linear function of \(x_t\) and \(w\),
that is,
in which \(P_t = w^Tx_t = \sum_i w(i)x_t(i)\) ,
where \(w(i)\) and \(x_t(i)\) are the ith components of \(w\) and \(x_t\) respectively.
In this case we have \(\grad_w P_t = x_t\),
and \eqref{eq:compute_Delta_w_t} reduces to the well known Widrow-Hoff rule
\citep{widrow_adaptive_1960}:
\begin{equation*}
    \Delta w_t = \alpha(z-w^Tx_t)x_t.
\end{equation*}


This linear learning method is also know as the ``delta rule'',
the ADALINE,
and the LMS filter.
It is widely used in connectionism,
pattern recognition,
signal processing,
and adaptive control.
The basic idea is that
the difference \(z-w^Tx_t\) represents
the scalar error between the prediction,
\(w^Tx_t\),
and what it should have been, \(z\).
This is multiplied by the observation vector \(x_t\)
to determine the weight changes because \(x_t\) indicates
how changing each weight will affect the error.
For example,
if the error is positive and \(x_t(i)\) is positive,
then wi(t) will be increased,
increasing \(w^Tx_t\) and reducing the error.
The Widrow-Hoff rule is simple, effective, and robust.
Its theory is also better developed than that of any other learning method
(e.g., see \citealp{widrow_adaptive_1985}).

Another instance of the prototypical supervised-learning procedure
is the ``generalized delta rule,'' or backpropagation procedure,
of \citet{rumelhart_learning_1985}.
In this case,
\(P_t\)is computed by a multi-layer connectionist network and is a nonlinear function of \(x_t\) and \(w\).
Nevertheless,
the update rule used is still exactly \eqref{eq:compute_Delta_w_t},
just as in the Widrow-Hoff rule,
the only difference being that a more complicated process is used to
compute the gradient \(\grad_w P_t\).
In any case,
note that all \(\Delta w_t\) in \eqref{eq:compute_Delta_w_t} depend
critically on \(z\),
and thus cannot be determined until
the end of the sequence when \(z\) becomes known.
Thus,
all observations and predictions made during a sequence
must be remembered until its end,
when all the \(\Delta w_t\)'s are computed.
In other words,
\eqref{eq:compute_Delta_w_t} cannot be computed incrementally.
There is, however,
a TD procedure that produces exactly the same result
as \eqref{eq:compute_Delta_w_t},
and yet which can be computed incrementally.
The key is to represent the error z − \(P_t\)
as a sum of changes in predictions,
that is, as
\begin{equation*}
    z − P_t =
    \sum_{k=t}^m (P_{k+1} − P_k),
    \mbox{ where } P_{m+1} \stackrel{\mathrm{def}}{=} z.
\end{equation*}

Using this,
equations \eqref{eq:update_w} and \eqref{eq:compute_Delta_w_t}
can be combined as
\begin{align*}
        w \leftarrow w + \sum_{t=1}^m \alpha(z − P_t)\grad_w P_t 
    = & w + \sum_{t=1}^m \alpha \sum_{k=t}^{m} (P_{k+1}-P_k)\grad_w P_t \\
    = & w + \sum_{k=1}^m \alpha \sum_{t=1}^{k} (P_{k+1}-P_k)\grad_w P_t \\
    = & w + \sum_{t=1}^m \alpha (P_{t+1}-P_t)\sum_{k=1}^{t} \grad_w P_t.
\end{align*}
In other words,
converting back to a rule to be used with \eqref{eq:update_w}:

\begin{equation} \label{eq:compute_Delta_w_t_incremently}
	\Delta w_t = \alpha(P_{t+1} − P_t)\sum_{k=1}^m \grad_w P_k.	
\end{equation}
Unlike \eqref{eq:compute_Delta_w_t},
this equation can be computed incrementally,
because each \(\Delta w_t\) depends only on a pair of successive predictions
and on the sum of all past values for \(\grad_w P_t\).
This saves substantially on memory,
because it is no longer necessary to
individually remember all past values of \(\grad_w P_t\).
Equation \eqref{eq:compute_Delta_w_t_incremently} also makes
much milder demands on the computational speed of the device that implements it;
although it requires slightly more arithmetic operations overall
(the additional ones are those needed to accumulate ),
they can be distributed over time more evenly.
Whereas \eqref{eq:compute_Delta_w_t_incremently}
computes one increment to \(w\) on each time step,
\eqref{eq:compute_Delta_w_t} must wait until a sequence is completed
and then compute all of the increments due to that sequence.
If M is the maximum possible length of a sequence,
then under many circumstances
\eqref{eq:compute_Delta_w_t_incremently} will require only
1/Mth of the memory and speed of that required by \eqref{eq:compute_Delta_w_t}.

For reasons that will be made clear shortly,
we refer to the procedure given by \eqref{eq:compute_Delta_w_t_incremently}
as the TD(1) procedure.
In addition,
we will refer to a procedure as linear if its predictions \(P_t\) are a linear function of the observation vectors \(x_t\) and the vector of memory parameters \(w\),
that is,
if \(P_t\) = \(w^Tx_t\).
We have just proven:
\begin{thm}\label{thm:incremental_widrow_hoff}
    On multi-step prediction problems,
    the linear TD(1) procedure produces the same per-sequence weight changes
    as the Widrow-Hoff procedure.
\end{thm}

Next,
we introduce a family of TD procedures
that produce weight changes different from
those of any supervised-learning procedure.
\section{The TD(λ) family of learning procedures}%
\label{sec:2_3}

The hallmark of temporal-difference methods is their sensitivity to changes in successive predictions rather than to overall error between predictions and the final outcome.
In response to an increase (decrease) in prediction
from \(P_t\) to \(P_{t+1}\),
an increment \(\Delta w_t\) is determined that increases (decreases)
the predictions for some or all of the preceding observation vectors
\(x_1,\ldots,x_t\).
The procedure given by \eqref{eq:compute_Delta_w_t_incremently}
is the special case in which
all of those predictions are altered to an equal extent.
In this article we also consider a class of TD procedures that
make greater alterations to more recent predictions.
In particular,
we consider an exponential weighting with recency,
in which alterations to the predictions of observation vectors
occurring k steps in the past are weighted
according to \(\lambda k\) for \(0\leq\lambda\leq 1\):
\begin{equation} \label{eq:compute_Delta_w_t_discounted_incremently}
	\Delta w_t =
    \alpha(P_{t+1} − P_t)\sum_{k=1}^m \lambda^{t-k} \grad_w P_k.	
\end{equation}
Note that for \(\lambda\) = 1
this is equivalent to \eqref{eq:compute_Delta_w_t_incremently},
the TD implementation of the prototypical supervised-learning method.
Accordingly,
we call this new procedure TD(\(\lambda\))
and the procedure given by \eqref{eq:compute_Delta_w_t_incremently},
TD(1).
Alterations of past predictions can be weighted in ways
other than the exponential form given above,
and this may be appropriate for particular applications.
However,
an important advantage to the exponential form is that
it can be computed incrementally.
Given that et is the value of the sum in
\eqref{eq:compute_Delta_w_t_discounted_incremently} for \(t\),
we can incrementally compute \(e^{t+1}\),
using only current information, as
\begin{align*}
      e_{t+1}
    = & \sum_{k=1}^{t+1} \lambda^{t+1-k} \grad_w P_k \\
    = & \grad_w P_{t+1} + \sum_{k=1}^t \lambda^{t+1-k} \grad_w P_k \\
    = & \grad_w P_{t+1} + \lambda e^t.
\end{align*}

For \(\lambda < 1\),
TD(\(\lambda\)) produces weight changes
different from those made by any supervised-learning method.
The difference is greatest in the case of TD(0) (where \(\lambda = 0\)),
in which the weight increment is determined only by its effect on the prediction associated with the most recent observation:
\begin{equation*} \label{eq:}
    \Delta w_t = \alpha(P_{t+1} − P_t)\grad_w P_t.
\end{equation*}
Note that this procedure is \emph{formally}~very similar
to the prototypical supervisedlearning procedure \eqref{eq:compute_Delta_w_t}.
The two equations are identical except that
the actual outcome \(z\) in \eqref{eq:compute_Delta_w_t} is replaced
by the next prediction \(P_t\)+1 in the equation above.
The two methods use the same learning \emph{mechanism},
but with different errors.
Because of these relationships and TD(0)'s overall simplicity,
it is an important focus here.
